Problem of distribution of epithet docs

Because most epithets do not have many representative documents, I will create another feature table, this time with most of the docs cut out.

Looking at the following, there is a long tail epithets with few surviving representatives.



In [1]:

    
from cltk.corpus.greek.tlg.parse_tlg_indices import get_epithet_index
import pandas

epithet_frequencies = []
for epithet, _ids in get_epithet_index().items():
    epithet_frequencies.append((epithet, len(_ids)))
df = pandas.DataFrame(epithet_frequencies)
df.sort_values(1, ascending=False)









    Out[1]:






  
    
      
      0
      1
    
  
  
    
      9
      Historici/-ae
      340
    
    
      43
      Philosophici/-ae
      245
    
    
      39
      Comici
      150
    
    
      16
      Tragici
      85
    
    
      32
      Grammatici
      74
    
    
      29
      Rhetorici
      67
    
    
      52
      Epici/-ae
      67
    
    
      1
      Scriptores Ecclesiastici
      63
    
    
      5
      Lyrici/-ae
      57
    
    
      46
      Medici
      46
    
    
      47
      Sophistae
      43
    
    
      28
      Poetae
      40
    
    
      50
      Theologici
      36
    
    
      17
      Elegiaci
      33
    
    
      25
      Alchemistae
      26
    
    
      37
      Epigrammatici/-ae
      23
    
    
      19
      Mathematici
      15
    
    
      23
      Astronomici
      14
    
    
      3
      Astrologici
      14
    
    
      30
      Iambici
      14
    
    
      8
      Geographi
      13
    
    
      26
      Oratores
      12
    
    
      48
      Epistolographi
      10
    
    
      45
      Biographi
      10
    
    
      20
      Apologetici
      9
    
    
      42
      Paradoxographi
      9
    
    
      36
      Periegetae
      9
    
    
      4
      Philologi
      9
    
    
      14
      Scriptores Erotici
      8
    
    
      12
      Mechanici
      8
    
    
      7
      Poetae Philosophi
      8
    
    
      24
      Mythographi
      7
    
    
      38
      Tactici
      6
    
    
      6
      Chronographi
      6
    
    
      53
      Parodii
      5
    
    
      35
      Scriptores Rerum Naturalium
      5
    
    
      2
      Paroemiographi
      5
    
    
      27
      Musici
      5
    
    
      51
      Gnomici
      4
    
    
      0
      Poetae Medici
      4
    
    
      31
      Atticistae
      4
    
    
      15
      Geometri
      4
    
    
      11
      Lexicographi
      4
    
    
      44
      Polyhistorici
      3
    
    
      54
      Bucolici
      3
    
    
      40
      Scriptores Fabularum
      2
    
    
      33
      Gnostici
      2
    
    
      22
      Mimographi
      2
    
    
      21
      Onirocritici
      2
    
    
      13
      Hagiographi
      2
    
    
      10
      Doxographi
      2
    
    
      41
      Hymnographi
      1
    
    
      34
      Nomographi
      1
    
    
      18
      Poetae Didactici
      1
    
    
      49
      Choliambographi
      1

Wikipedia on the long tail:

The specific cutoff of what part of a distribution is the "long tail" is often arbitrary, but in some cases may be specified objectively; see segmentation of rank-size distributions.

So I'll do this semi-objectively. I'm going to cut out any documents with a negative standard score (that is, below the mean). Thus, epithets with fewer than 26 (-0.064414235569960288) representative documents I will drop.

See following printout for z-score distribution



In [2]:

    
from scipy import stats

distribution = sorted(list(df[1]), reverse=True)
zscores = stats.zscore(distribution)
list(zip(distribution, zscores))









    Out[2]:





[(340, 5.2838254196858783),
 (245, 3.6657274348154809),
 (150, 2.047629449945084),
 (85, 0.94050977608639141),
 (74, 0.75315106204876658),
 (67, 0.63392278947936886),
 (67, 0.63392278947936886),
 (63, 0.56579234801114164),
 (57, 0.46359668580880081),
 (46, 0.27623797177117587),
 (43, 0.22514014067000546),
 (40, 0.17404230956883504),
 (36, 0.10591186810060781),
 (33, 0.054814036999437377),
 (26, -0.064414235569960288),
 (23, -0.11551206667113072),
 (15, -0.25177294960758517),
 (14, -0.26880555997464201),
 (14, -0.26880555997464201),
 (14, -0.26880555997464201),
 (13, -0.28583817034169878),
 (12, -0.30287078070875562),
 (10, -0.33693600144286923),
 (10, -0.33693600144286923),
 (9, -0.35396861180992606),
 (9, -0.35396861180992606),
 (9, -0.35396861180992606),
 (9, -0.35396861180992606),
 (8, -0.37100122217698284),
 (8, -0.37100122217698284),
 (8, -0.37100122217698284),
 (7, -0.38803383254403967),
 (6, -0.40506644291109645),
 (6, -0.40506644291109645),
 (5, -0.42209905327815328),
 (5, -0.42209905327815328),
 (5, -0.42209905327815328),
 (5, -0.42209905327815328),
 (4, -0.43913166364521011),
 (4, -0.43913166364521011),
 (4, -0.43913166364521011),
 (4, -0.43913166364521011),
 (4, -0.43913166364521011),
 (3, -0.45616427401226689),
 (3, -0.45616427401226689),
 (2, -0.47319688437932372),
 (2, -0.47319688437932372),
 (2, -0.47319688437932372),
 (2, -0.47319688437932372),
 (2, -0.47319688437932372),
 (2, -0.47319688437932372),
 (1, -0.4902294947463805),
 (1, -0.4902294947463805),
 (1, -0.4902294947463805),
 (1, -0.4902294947463805)]



In [22]:

    
# Make list of epithets to drop
to_drop = df[0].where(df[1] < 26)
to_drop = [epi for epi in to_drop if not type(epi) is float]
to_drop = set(to_drop)
to_drop









    Out[22]:





{'Apologetici',
 'Astrologici',
 'Astronomici',
 'Atticistae',
 'Biographi',
 'Bucolici',
 'Choliambographi',
 'Chronographi',
 'Doxographi',
 'Epigrammatici/-ae',
 'Epistolographi',
 'Geographi',
 'Geometri',
 'Gnomici',
 'Gnostici',
 'Hagiographi',
 'Hymnographi',
 'Iambici',
 'Lexicographi',
 'Mathematici',
 'Mechanici',
 'Mimographi',
 'Musici',
 'Mythographi',
 'Nomographi',
 'Onirocritici',
 'Oratores',
 'Paradoxographi',
 'Parodii',
 'Paroemiographi',
 'Periegetae',
 'Philologi',
 'Poetae Didactici',
 'Poetae Medici',
 'Poetae Philosophi',
 'Polyhistorici',
 'Scriptores Erotici',
 'Scriptores Fabularum',
 'Scriptores Rerum Naturalium',
 'Tactici'}

Make vectorizer

Now when loading documents, drop those belonging to an epithet in the to_drop list



In [23]:

    
import datetime as dt
import os
import time

from cltk.corpus.greek.tlg.parse_tlg_indices import get_epithet_of_author
from cltk.corpus.greek.tlg.parse_tlg_indices import get_id_author
import pandas
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer



In [24]:

    
def stream_lemmatized_files(corpus_dir):
    # return all docs in a dir
    user_dir = os.path.expanduser('~/cltk_data/user_data/' + corpus_dir)
    files = os.listdir(user_dir)

    for file in files:
        filepath = os.path.join(user_dir, file)
        with open(filepath) as fo:
            #TODO rm words less the 3 chars long
            yield file[3:-4], fo.read()



In [25]:

    
t0 = dt.datetime.utcnow()

map_id_author = get_id_author()

df = pandas.DataFrame(columns=['id', 'author' 'text', 'epithet'])

for _id, text in stream_lemmatized_files('tlg_lemmatized_no_accents_no_stops'):
    author = map_id_author[_id]
    epithet = get_epithet_of_author(_id)
    if epithet in to_drop:
        continue
    df = df.append({'id': _id, 'author': author, 'text': text, 'epithet': epithet}, ignore_index=True)

print(df.shape)
print('... finished in {}'.format(dt.datetime.utcnow() - t0))
print('Number of texts:', len(df))









    



(1587, 5)
... finished in 0:00:09.806514
Number of texts: 1587



In [26]:

    
text_list = df['text'].tolist()

# make a list of short texts to drop
# For pres, get distributions of words per doc
short_text_drop_index = [index if len(text) > 500 else None for index, text in enumerate(text_list) ]  # ~100 words



In [27]:

    
t0 = dt.datetime.utcnow()

# TODO: Consider using generator to CV http://stackoverflow.com/a/21600406

# time & size counts, w/ 50 texts:
# 0:01:15 & 202M @ ngram_range=(1, 3), min_df=2, max_features=500
# 0:00:26 & 80M @ ngram_range=(1, 2), analyzer='word', min_df=2, max_features=5000
# 0:00:24 & 81M @ ngram_range=(1, 2), analyzer='word', min_df=2, max_features=50000

# time & size counts, w/ 1823 texts:
# 0:02:18 & 46MB @ ngram_range=(1, 1), analyzer='word', min_df=2, max_features=500000
# 0:2:01 & 47 @ ngram_range=(1, 1), analyzer='word', min_df=2, max_features=1000000

# max features in the lemmatized data set: 551428
max_features = 100000
ngrams = 1
vectorizer = CountVectorizer(ngram_range=(1, ngrams), analyzer='word', 
                             min_df=2, max_features=max_features)
term_document_matrix = vectorizer.fit_transform(text_list)  # input is a list of strings, 1 per document

# save matrix
vector_fp = os.path.expanduser('~/cltk_data/user_data/vectorizer_test_features{0}_ngrams{1}.pickle'.format(max_features, ngrams))
joblib.dump(term_document_matrix, vector_fp)

print('... finished in {}'.format(dt.datetime.utcnow() - t0))









    



... finished in 0:00:51.457374

Transform term matrix into feature table



In [28]:

    
# Put BoW vectors into a new df
term_document_matrix = joblib.load(vector_fp)  # scipy.sparse.csr.csr_matrix



In [29]:

    
term_document_matrix.shape









    Out[29]:





(1587, 100000)



In [30]:

    
term_document_matrix_array = term_document_matrix.toarray()



In [31]:

    
dataframe_bow = pandas.DataFrame(term_document_matrix_array, columns=vectorizer.get_feature_names())



In [32]:

    
ids_list = df['id'].tolist()



In [33]:

    
len(ids_list)









    Out[33]:





1587



In [34]:

    
dataframe_bow.shape









    Out[34]:





(1587, 100000)



In [35]:

    
dataframe_bow['id'] = ids_list



In [36]:

    
authors_list = df['author'].tolist()
dataframe_bow['author'] = authors_list



In [37]:

    
epithets_list = df['epithet'].tolist()
dataframe_bow['epithet'] = epithets_list



In [38]:

    
# For pres, give distribution of epithets, including None
dataframe_bow['epithet']









    Out[38]:





0                  Historici/-ae
1                        Tragici
2                        Tragici
3                         Comici
4                           None
5                           None
6                  Historici/-ae
7               Philosophici/-ae
8                      Sophistae
9                     Theologici
10                 Historici/-ae
11      Scriptores Ecclesiastici
12                          None
13                    Lyrici/-ae
14              Philosophici/-ae
15                       Tragici
16                          None
17                          None
18                        Medici
19                 Historici/-ae
20                 Historici/-ae
21                        Medici
22                    Lyrici/-ae
23      Scriptores Ecclesiastici
24                       Tragici
25                          None
26                    Grammatici
27                 Historici/-ae
28                        Comici
29                        Comici
                  ...           
1557               Historici/-ae
1558                  Grammatici
1559                    Elegiaci
1560               Historici/-ae
1561               Historici/-ae
1562               Historici/-ae
1563                        None
1564               Historici/-ae
1565                        None
1566            Philosophici/-ae
1567            Philosophici/-ae
1568                    Elegiaci
1569                  Lyrici/-ae
1570                 Alchemistae
1571            Philosophici/-ae
1572            Philosophici/-ae
1573                      Comici
1574                      Comici
1575            Philosophici/-ae
1576                  Lyrici/-ae
1577                   Sophistae
1578                   Epici/-ae
1579            Philosophici/-ae
1580            Philosophici/-ae
1581               Historici/-ae
1582            Philosophici/-ae
1583                  Lyrici/-ae
1584               Historici/-ae
1585                        None
1586                      Comici
Name: epithet, dtype: object



In [39]:

    
t0 = dt.datetime.utcnow()

# removes 334
#! remove rows whose epithet = None
# note on selecting none in pandas: http://stackoverflow.com/a/24489602
dataframe_bow = dataframe_bow[dataframe_bow.epithet.notnull()]
dataframe_bow.shape

print('... finished in {}'.format(dt.datetime.utcnow() - t0))









    



... finished in 0:00:00.489680



In [40]:

    
t0 = dt.datetime.utcnow()

dataframe_bow.to_csv(os.path.expanduser('~/cltk_data/user_data/tlg_bow.csv'))

print('... finished in {}'.format(dt.datetime.utcnow() - t0))









    



... finished in 0:03:55.564356



In [41]:

    
dataframe_bow.shape









    Out[41]:





(1253, 100003)



In [42]:

    
dataframe_bow.head(10)









    Out[42]:






  
    
      
      αʹ
      ααα
      ααπτος
      ααπτους
      ααρων
      αασαμην
      αατος
      ααω
      αβαθης
      αβακιον
      ...
      ϲωμα
      ϲωματα
      ϲωματι
      ϲωματοϲ
      ϲωματων
      ϲωμαϲι
      ϲωμαϲιν
      id
      author
      epithet
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      1459
      Lepidus Hist.
      Historici/-ae
    
    
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0825
      Melito Trag.
      Tragici
    
    
      2
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0331
      [Polyidus] Trag.
      Tragici
    
    
      3
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0417
      Archippus Comic.
      Comici
    
    
      6
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      2475
      Menecrates Hist.
      Historici/-ae
    
    
      7
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      4075
      Marinus Phil.
      Philosophici/-ae
    
    
      8
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      2127
      Troilus Soph.
      Sophistae
    
    
      9
      0
      0
      0
      0
      4
      0
      1
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      2074
      Apollinaris Theol.
      Theologici
    
    
      10
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      2173
      Antileon Hist.
      Historici/-ae
    
    
      11
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      1419
      Hermas Scr. Eccl., Pastor Hermae
      Scriptores Ecclesiastici
    
  

10 rows × 100003 columns



In [43]:

    
# write dataframe_bow to disk, for fast reuse while classifying
# 2.3G
fp_df = os.path.expanduser('~/cltk_data/user_data/tlg_bow_df.pickle')
joblib.dump(dataframe_bow, fp_df)









    Out[43]:





['/root/cltk_data/user_data/tlg_bow_df.pickle']



In [ ]:

	0	1
9	Historici/-ae	340
43	Philosophici/-ae	245
39	Comici	150
16	Tragici	85
32	Grammatici	74
29	Rhetorici	67
52	Epici/-ae	67
1	Scriptores Ecclesiastici	63
5	Lyrici/-ae	57
46	Medici	46
47	Sophistae	43
28	Poetae	40
50	Theologici	36
17	Elegiaci	33
25	Alchemistae	26
37	Epigrammatici/-ae	23
19	Mathematici	15
23	Astronomici	14
3	Astrologici	14
30	Iambici	14
8	Geographi	13
26	Oratores	12
48	Epistolographi	10
45	Biographi	10
20	Apologetici	9
42	Paradoxographi	9
36	Periegetae	9
4	Philologi	9
14	Scriptores Erotici	8
12	Mechanici	8
7	Poetae Philosophi	8
24	Mythographi	7
38	Tactici	6
6	Chronographi	6
53	Parodii	5
35	Scriptores Rerum Naturalium	5
2	Paroemiographi	5
27	Musici	5
51	Gnomici	4
0	Poetae Medici	4
31	Atticistae	4
15	Geometri	4
11	Lexicographi	4
44	Polyhistorici	3
54	Bucolici	3
40	Scriptores Fabularum	2
33	Gnostici	2
22	Mimographi	2
21	Onirocritici	2
13	Hagiographi	2
10	Doxographi	2
41	Hymnographi	1
34	Nomographi	1
18	Poetae Didactici	1
49	Choliambographi	1

	ααρων	αατος	...	id	author	epithet
0	0	0	...	1459	Lepidus Hist.	Historici/-ae
1	0	0	...	0825	Melito Trag.	Tragici
2	0	0	...	0331	[Polyidus] Trag.	Tragici
3	0	0	...	0417	Archippus Comic.	Comici
6	0	0	...	2475	Menecrates Hist.	Historici/-ae
7	0	0	...	4075	Marinus Phil.	Philosophici/-ae
8	0	0	...	2127	Troilus Soph.	Sophistae
9	4	1	...	2074	Apollinaris Theol.	Theologici
10	0	0	...	2173	Antileon Hist.	Historici/-ae
11	0	0	...	1419	Hermas Scr. Eccl., Pastor Hermae	Scriptores Ecclesiastici